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METHOD AND SYSTEM OF CORRECTING SPECTRTOi 
DEFORMATIONS IN THE VOICE, INTRODUCED BY A 
COMMUNICATION NETWORK. 



BACKGROUND OF THE INVENTION 
Field of the invention 

The invention concerns a method for the 
5 multiref erence correction of voice spectral 
deformations introduced by a communication network. It 
also concerns a system for implementing the method. 

The aim of the present invention is to improve the 
quality of the speech transmitted over communication 
10 networks, by offering means for correcting the spectral 
deformations of the speech signal, deformations caused 
by various links in the network transmission chain. 

The description which is given of this hereinafter 
explicitly makes reference to the transmission of 
15 speech over "conventional" (that is to say cabled) 
telephone lines, but also applies to any type of 
communication network (fixed, mobile or other) 
introducing spectral deformations into the signal, the 
parameters taken as a reference for specifying the 
20 network having to be modified according to the network. 
Description of prior art 

The various deformations encountered in the case 
of the switched telephone network (STN) will be stated 
below. 

25 1.1. Degradations in the timbre of the voice on 
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the STN network: 

Figure 1 depicts a diagram of an STN connection. 
The speech emitted by a speaker is transmitted by a 
sending terminal 10, is transported by the subscriber 
5 line 20, undergoes an analogue to digital conversion 30 
(law A); transmitted by the digital network 40, 
undergoes a digital (law A) to analogue conversion 50, 
is transmitted by the subscriber link 60, and passes 
through the receiving terminal 7 0 in order finally to 

10 be received by the destination person. 

Each speaker is connected by an analogue line 
(twisted pair) to the closest telephone exchange. This 
is a base band analogue transmission referenced 1 and 3 
in Figure 1. The connection between the exchanges 

15 follows an entirely digital network. The spectriim of 
the voice is affected by two types of distortion during 
the analogue transmission of the base band signal. 

The first type of distortion is the bandwidth 
filtering of the terminals and the points of access to 

20 the digital part of the network. The typical 
characteristics of this filtering are described by UIT- 
T under the name "intermediate reference system" (IRS) 
(UIT-T, Recommendation P. 48, 1988). These frequency 
characteristics, resulting from measurements made 

25 during the 1970s, are tending however to become 
obsolete. This is why the UIT-T has recommended since 
1996 using a ^'modified" IRS (UIT-T, Recommendation 
P. 830, 1996), the nominal characteristic of which is 
depicted in Figure 2 for the transmission part and in 

30 Figure 3 for the receiving part. Between 200 and 3400 
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Hz, the tolerance is ± 2.5 dB; below 
200 Hz, the decrease in the characteristic of the 
global system must be at least 15 dB per octave. The 
transmission and reception parts of the IRS are called 
5 respectively, according to the UIT-T terminology, the 
''transmitting system" and the "receiving system" . 

The second distortion affecting the voice spectrtim 
is the attenuation of the subscriber lines. In a simple 
model of the local analogue line (given in a CNET 

10 Technical Note NT/LAA/ELR/289 by Cadoret, 1983), it is 
considered that this introduces an attenuation of the 
signal whose value in dB depends on its length and is 
proportional to the square root of the frequency. The 
attenuation is 3 dB at 800 Hz for an average line 

15 (approximately 2 km), 9.5 dB at 800 Hz for longer lines 
(up to 10 km) . According to this model, the expression 
for the attenuation of a line, depicted in Figure 4, 
is : 



20 




(0.1) 



To these distortions there is added the anti- 
aliasing filtering of the MIC coder (ref 30) , The 
latter is typically a 200-3400 Hz bandpass filter with 
25 a response which is almost flat over the bandwidth and 
high attenuation outside the band, according to the 
template in Figure 5 for example (National 
Semiconductor, August 1994: Technical Documentation 
TP3054, TP3057) . 
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Finally, the voice suffers spectral distortion as 
depicted in Figure 6 for the various combinations of 
three types of analogue line in transmission and 
reception (that is to say 6 distortions), assuming 
5 equipment complying with the nominal characteristic of 
the modified SRI. The voice thus appears to be stifled 
if one of the analogue lines is long and in all cases 
suffers from a lack of "presence" due to the 
attenuation of the low-frequency components. 

10 1.2. Degradations in the timbre of the voice on 

the ISDN network and the GSM mobile network 

In ISDN and the GSM network, the signal is 
digitised as from the terminal. The only analogue parts 
are the transmission and reception transducers 

15 associated with their respective amplification and 
conditioning chains. The UIT-T has defined frequency 
efficacy templates for transmission depicted in Figure 
1 , and for reception depicted in Figure 8, valid both 
for cabled digital telephones (UIT-T, Recommendation 

20 P. 310, May 2000) and mobile digital or wireless 
terminals (UIT-T, Recommendation P. 313, September 
1999) . 

Moreover, for GSM networks, it is recognised that 
coding and decoding slightly modify the spectral 
25 envelope of the signal. This alteration is shown in 
Figure 9 for pink noise coded and then decoded in EFR 
(Enhanced Full Rate) mode. 

The effect of these filterings on the timbre is 
mainly an attenuation of the low- frequency components, 
30 less marked however than in the case of STN. 
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The invention concerns the correction of these 
spectral distortions by means of a centralized 
GontrQliDcd processing^ that is to say a device 
installed in the digital part of the network, as 
indicated in Figure 10 for the STN. 

The objective of a correction of the voice timbre 
is that the voice timbre in reception is as close as 
possible to that of the voice emitted by the speaker, 
which will be termed the original voice. 

2 . Prior art 



Compensation for the spectral distortions 
introduced into the speech signal by the various 
15 elements of the telephone connection is at the present 
time allowed by devices with an equalization 
oqualioation base. The latter can be fixed or be 
adapted according to the transmission conditions. 

20 2.1. Fixed equalization equalisation 

Centralised equalization oqualioation devices were 
proposed in the patents US 5333195 (Duane O. Bowker) 
and US 5471527 (Helena S. Ho) . These equalizers 
25 cqualiaor are fixed filters which restore the level of 
the low frequencies attenuated by the transmitter. 
Bowker proposes for example a gain of 10 to 15 dB on 
the 100-300 Hz band. These methods have two drawbacks: 

30 * The equalizer cqualiaor compensates only for the 
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filtering of the transmitter, so that on reception the 
low- frequency components remain greatly attenuated by 
the IRS reception filtering. 

5 * This fixed equalization cqualiaation compensates 

for the average transmission conditions (transmission 
system and line) . If the actual conditions are too 
different (for example if the analogue lines are long) 
the device does not sufficiently correct the timbre, or 
10 even impairs it more than the connection without 
equalization oqualioation . 

2.2. Adaptive equalization equalisation 

15 The invention described in the patent US 5915235 

(Andrew P De Jaco) aims to correct the non-ideal 
frequency response of a mobile telephone transducer. 
The equalizer cqualiocr is described as being placed 
between the analogue to digital converter and the CELP 

20 coder but can be equally well in the terminal or in the 
network. The principle of equalization equalisation is 
to bring the spectrum of the received signal close to 
an ideal spectrum. Two methods are proposed. 

25 The first method (illustrated by Figure 4 in the 

aforementioned patent of De Jaco) consists of 
calculating long-term autocorrelation coefficients Rlt: 

RLT(n,i) = aRLT(n-'l,i) + (l-a)R(n,i), (0.2) 

30 
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with RLT(n,i) the i long-term autocorrelation 
coefficient to the n*"^ frame, R(n,i) the i*^^ 
autocorrelation coefficient specific to the n^^ frame, 
and a a smoothing constant fixed for example at 0.995. 
5 From these coefficients there are derived the long-term 
LPC coefficients, which are the coefficients of a 
whitening filter. At the output of this filter, the 
signal is filtered by a fixed signal which imprints on 
it the ideal long-term spectral characteristics, i.e. 
10 those which it would have at the output of a transducer 
having the ideal frequency response. These two filters 
are supplemented by a multiplicative gain equal to the 
ratio between the long-term energies, of the input of 
the whitener and the output of the second filter. 

15 

The second method, illustrated by Figure 5 of the 
aforementioned De Jaco patent, consists of dividing the 
signal into sub-bands and, for each sub-band, applying 
a multiplicative gain so as to reach a target energy, 
20 this gain being defined as the ratio between the target 
energy of the sub-band and the long-term energy 
(obtained by a smoothing of the instantaneous energy) 
of the signal in this sub-band. 

25 These two methods have the drawback of correcting 

only the non-ideal response of the transmission system 
and not that of the reception system. 

The object of the device of the patent US 5905969 
30 (Chafik Mokbel) is to compensate for the filtering of 
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the transmission signal and of the subscriber line in 
order to improve the centralised recognition of the 
speech and/or the quality of the speech transmitted. As 
presented by Figure 3a in Mokbel, the spectrum of the 
5 signal is divided into 24 sub-bands and each sub-band 
energy is multiplied by an adaptive gain.^ The matching 
of the gain is achieved according to the stochastic 
gradient algorithm, by minimisation of the square 
error, the error being defined as the difference 

10 between the sub-band energy and a reference energy 
defined for each sub-band. The reference energy is 
modulated for each frame by the energy of the current 
frame, so as to respect the natural short-term 
variations in level of the speech signal. The 

15 convergence of the algorithm makes it possible to 
obtain as an output the 24 equalized oquQliocd d sub- 
band signals. 

If the application aimed at is the improvement in 
20 the voice quality, the equalized oqualiocd speech 
signal is obtained by inverse Fourier transform of the 
equalized oqualiood sub-band energy. 

The Mokbel patent does not mention any results in 
25 terms of improvement in the voice quality, and 
recognises that the method is sub-optimal, in that it 
uses a circular convolution. Moreover, it is doubtful 
that a speech signal can be reconstructed correctly by 
the inverse Fourier transform of band energies 
30 distributed according to the MEL scale. Finally, the 
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device described as not correct the filtering of the 
reception signal and of the analogue reception line. 

The compensation for the line effect is achieved 
5 in the ''Mokbel" method of cepstral subtraction, for the 
purpose of improving the robustness of the speech 
recognition. It is shown that the cepstrum of the 
transmission channel can be estimated by means of the 
mean cepstrum of the signal received, the latter first 
10 being whitened by a pre-accentuation filter. This 
method affords a clear improvement in the performance 
of the recognition systems but is considered to be an 
''off-line" method, 2 to 4 seconds being necessary for 
estimating the mean cepstrum. 

15 

2.3. Another state of the art combines a fixed 
pre -equalization pre equ a lisation with an adapted 
equalization cqualioation and has been the subject of 
the filing of a patent application FR 2822999 by the 
20 applicant. The device described aims to correct the 
timbre of the voice by combining two filters. 

A fixed filter, called the pre-equalizer pro 
cquQliocr , compensates for the distortions of an 

25 average telephone line, defined as consisting of two 
average subscriber lines and transmission and reception 
systems complying with the nominal frequency responses 
defined in UIT-T, Recommendation P. 48, App.I, 1988. Its 
frequency response on the Fc-3150 Hz band is the 

3 0 inverse of the global response of the analogue part of 
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this average connection, Fc being the limit 
equalization oqualioation low frequency. 

This pre -equalization pre cqualioation is 
5 supplemented by an adapted equalizer cqualiocr , which 
adapts the correction more precisely to the actual 
transmission conditions. The frequency response of the 
adapted equalizer cqualioor is given by: 

with L_RX the frequency response of the reception 
line, S_RX the frequency response of the reception 
system and Yx(-f) the long-term spectrum of the output x 
15 of the pre-equalizer pro cqualiocr . 

The long-term spectrum is defined by the temporal 
mean of the short-term spectra of the successive frames 
of the signal; Yref(f)/ referred to as the reference 

20 spectrum, is the mean spectrum of the speech defined by 
the UIT (UIT-T/P.50/App. I, 1998), taken as an 
approximation of the original long-term spectrum of the 
speaker. Because of this approximation, the frequency 
response of the adapted equalizer cqualiacr is very 

25 irregular and only its general shape is pertinent. This 
is why it must be smoothed. The adapted equalizer 
cqualiocr being produced in the form of a time filter 
RIF, this smoothing in the frequency domain is obtained 
by a narrow windowing (symmetrical) of the pulsed 
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response . 

This method makes it possible to restore a timbre 
close to that of the original signal on the 
5 equalization oqualioation band (Fc-3150 Hz), but: 

- for some speakers, the approximation of their 
original long-term spectrum by means of the reference 
spectrum is very rough, so that the equalizer oqualioor 

10 introduces a perceptible distortion; 

- the high smoothing of the frequency response of 
the equalizer cqualiocr , made necessary by the 
approximation error, prevents fine spectral distortions 

15 from being corrected. 

SUMMARY OF THE INVENTION 

The aim of the invention is to remedy the 
20 drawbacks of the prior art. Its object is a method and 
system for improving the correction of the timbre by 
reducing the approximation error in the original long- 
term spectrum of the speakers. 

To this end, it is proposed to classify the 
25 speakers according to their long-term spectrum and to 
approximate this not by a single reference spectrum but 
by one reference spectrum per class. The method 
proposed makes it possible to carry out an equalization 
oquQlioation processing able to determine the class of 
30 the speaker and to equalize cqualiGc according to the 
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reference spectrum of the class. This reduction in the 
approximation error makes it possible to smooth the 
frequency response of the adapted equalizer cqualigor 
less strongly, making it able to correct finer spectral 
5 distortions. 

The object of the present invention is more 
particularly a method of correcting spectral 
deformations in the voice, introduced by a 
communication network, comprising an operation of 

10 equalization cqualioation on a frequency band (F1-F2), 
adapted to the actual distortion of the transmission 
chain, this operation being performed by means of a 
digital filter having a frequency response which is a 
function of the ratio between a reference spectrum and 

15 a spectrum corresponding to the long-term spectrum of 
the voice signal of the speakers, principally 
characterised in that it comprises: 

* prior to the operation of equalization 
equalisation of the voice signal of a speaker 

20 communicating: 

- the constitution of classes of speakers with one 
voice reference per class, 

* then, for a given speaker communicating: 

- the classification of this speaker, that is to 
25 say his allocation to a class from predefined 

classification criteria in order to make a voice 
reference which is closest to his own correspond to 
him, 

- the equalization cqualioation of the digitised 
30 signal of the voice of the speaker carried out with, as 
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a reference spectrum, the voice reference of the class 
to which the said speaker has been allocated. 

According to another characteristic, the 
5 constitution of classes of speakers comprises: 

" the choice of a corpus of N speakers recorded 
under non-degraded conditions and the determination of 
their long-term frequency spectrum, 

- the classification of the speakers in the corpus 
10 according to their partial cepstrum, that is to say the 

cepstrum calculated from the long-term spectr\un 
restricted to the equalization oqualioation band (Fl- 
F2) and applying a predefined classification criterion 
to these cepstra in order to obtain K classes, 

15 - the calculation of the reference spectrum 

associated with each class so as to obtain a voice 
reference corresponding to each of the classes . 

According to another characteristic, the reference 
spectrum on the equalization equal i sat ion frequency 

20 band (F1-F2), associated with each class, is calculated 
by Fourier transform of the ccntro center of the class 
defined by its partial cepstrum. 

According to another characteristic, the 
classification of a speaker comprises: 

25 - use of the mean pitch of the voice signal and of 

the partial cepstrum of this signal as classification 
parameters, 

- the application of a discriminating function to 
these parameters in order to classify the said speaker. 

30 According to the invention the method also 
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comprises a step of pre -equalization pro Qqualioation 
of the digital signal by a fixed filter having a 
frequency response in the frequency band (F1-F2) , 
corresponding to the inverse of a reference spectral 
deformation introduced by the telephone connection. 

According to another characteristic, the 
equalization oqualioation of the digitised signal of 
the voice of a speaker comprises: 

- the detection of a voice activity on the line in 
order to trigger a concatenation of processings 
comprising the calculation of the long-term spectrum, 
the classification of the speaker, the calculation of 
the modulus of the frequency response of the equalizer 
oqualiQcr filter restricted to the equalization 
oqualioation band (F1-F2) and the calculation of the 
coefficients of the digital filter differentiated 
according to the class of the speaker, from this 
modulus , 

- the control of the filter with the coefficients 
obtained, 

- the filtering of the signal emerging from the 
pre- equalizer pre cqualioor by the said filter. 

According to another characteristic, the 
calculation of the modulus (EQ) of the frequency 
response of the equalizer cqualioor filter restricted 
to the equalization oqualioation band {F1-F2) is 
achieved by the use of the following equation: 




(0.3) 
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in which Yrefif) is the reference spectrum of the 
class to which the said speaker belongs, 

and in which L_RX is the frequency response of the 
5 reception line, S_RX is the frequency response of the 
reception signal and Yx(f) the long-term spectrum of the 
input signal x of the filter. 

According to a variant, the calculation of the 
modulus of the frequency response of the equalizer 
10 equal i per filter restricted to the equalization 
oqualiaation band {F1-F2) is done using the following 
equation: 

15 

in which C/^ , Cj" , C^ j.^, and C[ j,y are the 
respective partial cepstra of the adapted equalizer 
GqualiDor , of the input signal x of the equalizer 
cquQliocr filter, of the reception system and of the 
20 reception line, C^^j being the reference partial 
cepstrum, the contro center of the class of the 
speaker. The modulus (EQ) restricted to the band F1-F2 
is then calculated by discrete Fourier transform of C . 

-* eg 

Another object of the invention is a system for 
25 correcting voice spectral deformations introduced by a 
communication network, comprising adapted equalization 
oquQlioation means in a frequency band (F1-F2) which 
comprise a digital filter whose frequency response is a 
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function of the ratio between a reference spectrum and 
a spectrum corresponding to the long-term spectrum of a 
voice signal; principally characterised in that these 
means also comprise: 

- means of processing the signal for calculating 
the coefficients of the digital signal provided with: 

• a signal processing unit for calculating the 
modulus of the frequency response of the 
equalizer cqualioor filter restricted to the 
equalization oqualioation band (F1-F2) according 
to the following equation: 



in which Yrefif) is the reference spectrum, which 
may be different from one speaker to another and which 
corresponds to a reference for a predetermined class 
to which the said speaker belongs, and in which L_RX 
is the frequency response of the reception line, S_RX 
the frequency response of the reception signal and 
Yx(f) the long-term spectrum of the input signal x of 
the filter; 

• a second processing unit for calculating the 
pulsed response from the frequency response 
modulus thus calculated, in order to determine 
the coefficients of the filter differentiated 
according to the class of the speaker. 
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According to another characteristic, the first 
processing unit comprises means of calculating the 
partial cepstrum of the equalizer oqualiocr filter 
according to the equation: 

5 

in which C/^ , C; , C^ ^^^ C[ rx the 

respective partial cepstra of the adapted equalizer 
10 cqualioor , of the input signal x of the equalizer 
equal isor filter, of the reception signal and of the 
reception line, C^^j- being the reference partial 

cepstrum, the oontro center of the class of the 
speaker, the modulus of (EQ) restricted to the band Fl- 
15 F2 is then calculated by discrete Fourier transform of 

^eq • 

According to another characteristic, the first 
processing unit comprises a sub-assembly for 
calculating the coefficients of the partial cepstrum of 

20 a speaker communicating and a second sub-assembly for 
effecting the classification of this speaker, this 
second sub-assembly comprising a unit for calculating 
the pitch Fo, a unit for estimating the mean pitch from 
the calculated pitch Fq, and a classification unit 

25 applying a discriminating function to the vector x 
having as its components the mean pitch and the 
coefficients of the partial cepstrum for classifying 
the said speaker. 

According to the invention, the system also 
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comprises a pre- equa 1 i z a t i on pro cquQliaation , the 
signal equalized Gqualiaod from reference spectra 
differentiated according to the class of the speaker 
being the output signal x of the pre-equalizer pro 
5 Gqualiaor . 

BRIEF DESCRIPTION OF THE DRAWINGS 

Other particularities and advantages of the 
10 invention will emerge clearly from the following 
description, which is given by way of illustrative and 
non- limiting example and which is made with regard to 
the accompanying figures, which show: 

- Figure 1, a diagrammatic telephone connection 
15 for a switched telephone network (STN) , 

- Figure 2, the transmission frequency response 
curve of the modified intermediate reference system 
IRS, 

- Figure 3, the reception frequency response curve 
20 of the modified intermediate reference system IRS, 

Figure 4, the frequency response of the 
subscriber lines according to their length, 

- Figure 5, the template of the anti-aliasing 
filter of the MIC coder, 

25 - Figure 6, the spectral distortions suffered by 

the speech on the switched telephone network with 
average IRS and various combinations of analogue lines, 

- Figure 7, the transmission template for the 
digital terminals , 

30 - Figure 8, the reception template for the digital 
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terminals, 

- Figure 9, the spectral distortion introduced by 

GSM coding /decoding in EFR (Enhanced Full Rate) mode. 

Figure 10, the diagram of a communication 
5 network with a system for correcting the speech 
distortions , 

- Figure 11, the steps of calculating the partial 
cepstrum, 

- Figure 12, the classification of the partial 

10 

cepstra according to the variance criterion, 

Figures 13a and 13b, the long-teirm spectra 
corresponding to the ccntroD centers of the classes of 
speakers respectively for men and women, 
15 - Figure 14, the frequency characteristics of the 

filterings applied to the corpus in order to define the 
learning corpus, 

- Figure 15, the frequency response of the pre- 
equalizer pre cqualiGcr for various frequencies Fc, 

20 - Figure 16, the scheme for implementing the 

system of correction by differentiated equalization 
oqualioation per class of speaker, 

- Figure 17, a variant execution of the system 
according to Figure 16. 

25 

DETAILED DESCRIPTION OF THE DRAWINGS 

Throughout the following the same references 
entered on the drawings correspond to the same 
3 0 elements. 
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The description which follows will first of all 
present the prior step of classification of a corpus of 
speakers according to their long-term spectrum. This 
5 step defines K classes and one reference per class. 

A concatenation of processings makes it possible 
to process the speech signal (as soon as a voice 
activity is detected by the system) for each speaker in 
10 order on the one hand to classify the speakers, that is 
to say to allocate them to a class according to 
predetermined criteria, and on the other hand to 
correct the voice using the reference of the class of 
the speaker . 

15 

Prior step of classification of the speakers. 
* Choice of the class definition corpus. 

20 The reference spectrum being an approximation of 

the original long-term spectrum of the speakers, the 
definition of the classes of speakers and their 
respective reference spectra requires having available 
a corpus of speakers recorded under non-degraded 

25 conditions. In particular, the long-term spectrum of a 
speaker measured on this recording must be able to be 
considered to be its original spectrum, i.e. that of 
its voice at the transmission end of a telephone 
connection . 

30 
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Definition of the individual; the partial cepstrum 

The processing proposed makes it possible to have 
available, in each class, a reference spectrum as close 
5 as possible to the long-term spectrxam of each member of 
the class. However, only the part of the spectrum 
included in the oqualioation equalization band F1-F2 is 
taken into account in the adapted cqualioation 
equalization processing. The classes are therefore 
10 formed according to the long-term spectrum restricted 
to this band. 

Moreover, the comparison between two spectra is 
made at a low spectral resolution level, so as to 
15 reflect only the spectral envelope. This is why the 

space of the first cepstral coefficients of order 
greater than 0 (the coefficient of order 0 representing 
the energy) is preferably used, the choice of the 
number of coefficients depending on the required 
20 spectral resolution. 

The ''long-term partial cepstrum", which is denoted 
Cp, is then determined in the processing as the 
cepstral representation of the long-term spectrum 
25 restricted to a frequency band. If the frequency 
indices corresponding respectively to the frequencies 
Fl and F2 are denoted kl and k2 and the long-term 
spectrum of the speech is denoted the partial 

cepstrum is defined by the equation: 

30 
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C = TFD-' (l01og(r(^p..)t Jo y{k^ ^l,.,k, + l))) (0.4) 

where o designates the concatenation operation. 

5 The inverse discrete Fourier transform is 

calculated for example by IFFT after interpolation of 
the samples of the truncated spectrum so as to achieve 
a number of power samples of 2. For example, by 
choosing the oqualioation equalization band 

10 187-3187 Hz, corresponding to the frequency indices 5 
to 101 for a representation of the spectrum (made 
symmetrical) on 256 points (from 0 to 255) the 
interpolation is made simply by interposing a frequency 
line (interpolated linearly) every three lines in the 

15 spectrum restricted to 187-3187 Hz. 

The steps of the calculation of the partial 
cepstrum are shown in Figure 11 . 

20 For the cepstral coefficients to reflect the 

spectral envelope but not the influence of the harmonic 
structure of the spectrum of the speech on the long- 
term spectra, the high-order coefficients are not kept. 
The speakers to be classified are therefore represented 

25 by the coefficients of orders 1 to L of their long-term 
partial cepstrum, L typically being equal to 20. 

* The classification. 

30 The classes are formed for example in a non- 
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supervised manner, according to an ascending 
hierarchical classification . 

This consists of creating, from N separate 
5 individuals, a hierarchy of partitionings according to 
the following process: at each step, the two closest 
elements are aggregated, an element being either a non- 
aggregated individual or an aggregate of individuals 
formed during a previous step. The proximity between 

10 two elements is determined by a measurement of 
dissimilarity which is called distance. The process 
continues until the whole population is aggregated. The 
hierarchy of partitionings thus created can be 
represented in the form of a tree like the one in 

15 Figure 12, containing N-1 imbricated partitionings. 

Each cut of the tree supplies a partitioning, which is 
all the finer, the lower the cut. 

In this type of classification, as a measurement 
20 of distance between two elements, the intra-class 
inertia variation resulting from their aggregation is 
chosen. A partitioning is in fact all the better, the 
more homogeneous are the classes created, that is to 
say the lower the intra-class inertia. In the case of a 
25 cloud of points xi with respective masses mi, 
distributed in q classes with respective Gcntrco 
centers of gravity gq, the intra-class inertia is 
defined by: 
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q ieq 

The intra-class inertia, zero at the initial step 
of the calculation algorithm, inevitably increases with 
5 each aggregation. 

Use is preferably made of the known principle of 
aggregation according to variance. According to this 
principle, at each step of the algorithm used, the two 
10 elements are sought whose aggregation produces the 
lowest increase in intra-class inertia. 

The partitioning thus obtained is improved by a 
procedure of aggregation around the movable Gcntrco 
-^^ centers , which reduces the intra-class variance. 

The reference spectrum, on the band F1-F2, 
associated with each class is calculated by Fourier 
transform of the ccntrG center of the class. 

20 

* Example of classification. 

The processing described above is applied to a 
corpus of 63 speakers. The classification tree of the 
25 corpus is shown in Figure 12. In this representation, 
the height of a horizontal segment aggregating two 
elements is chosen so as to be proportional to their 
distance, which makes it possible to display the 
proximity of the elements grouped together in the same 
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class. This representation facilitates the choice of 
the level of cutoff of the tree and therefore of the 

classes adopted. The cutoff must be made above the low- 
level aggregations, which group together close 
5 individuals, and below the high-level aggregations, 
which associate clearly distinct groups of individuals. 

In this way, four classes are clearly obtained {K 
= 4). These classes are very homogeneous from the point 
10 of view of the sex of the speakers, and a division of 
the tree into two classes shows approximately one class 
of men and one class of women. 

The consolidation of this partitioning by means of 
15 an aggregation procedure around the movable ccntrco 
centers results in four classes of cardinals 11, 18, 18 
and 16, more homogeneous than before from the point of 
view of the sex: only one man and two women are 
allocated to classes not corresponding to their sex. 

20 

The spectra restricted to the 187-3187 Hz band 
corresponding to the contrco centers of these classes 
are shown in Figures 13a and 13b for the men and women 
classes as well as for their respective sub-classes. 
25 These spectra, the results of the classification, are 
used as a multiple reference by the adapted equalizer 
Gqualiocr . 

* Use of classification criteria for the speakers 

30 
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The classes of speakers being defined, the 
processing provides for the use of parameters and 
criteria for allocating a speaker to one or other of 
the classes. 

5 

This allocation is not carried out simply 
according to the proximity of the partial cepstrum with 
one of the class contrco centers , since this cepstrum 
is diverted by the part of the telephone connection 
10 upstream of the equalizer oqualioor . 

It is advantageously proposed to use 
classification criteria which are robust to this 
diversion. This robustness is ensured both by the 
15 choice of the classification parameters and by that of 
the classification criteria learning corpus. 

* Preferably the classification parameters average 
pitch and partial cepstrum are used 

20 

The classes previously defined are homogeneous 
from the point of view of the sex. The average pitch 
being both fairly discriminating for a man/woman 
classification and insensitive to the spectral 
25 distortions caused by a telephone connection, and is 
therefore used as a classification parameter conjointly 
with the partial cepstrum. 

* Choice of the classification criteria learning 

30 corpus 
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A discrimination technique is applied to these 
parameters, for example the usual technique of 
discriminating linear analysis. 

5 

Other known techniques can be used such as a non- 
linear technique using a neural network. 

If N individuals are available, described by 
10 dimension vectors p and distributed a priori in K 
classes, the discriminating linear analysis consists 
of: 

- firstly, seeking the K-1 independent linear 
15 functions which best separate the K classes. It is a 
case of determining which are the linear combinations 
of the p components of the vectors which minimise the 
intra-class variance and maximise the inter-class 
variance; 

20 

secondly, determining the class of a new 
individual by applying the discriminating linear 
functions to the vector representing him. 

25 In the present case, the vectors representing the 

individuals have as their components the pitch and the 
coefficients 1 to L (typically L = 20) of the partial 
cepstrum. The robustness of the discriminating 
functions to the deviation of the cepstral coefficients 

30 is ensured both by the presence of the pitch in the 
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parameters and by the choice of the learning corpus. 
The latter is composed of individuals whose original 
voice has undergone a great diversity of filtering 
representing distortions caused by the telephone 
5 connections. 

More precisely, from a corpus of original voices 
(non-degraded) of N speakers, there is defined a corpus 

of N vectors of components [Fo;C^{i);...;C^{l)\, with Fo the 
10 mean pitch and C^ the partial cepstrum. The construction 
of the learning corpus of the said functions consists 
of defining a set of M cepstral biases which are each 
added to each partial cepstrum representing a speaker 
in the original corpus, which makes it possible to 
15 obtain a new corpus of NM individuals. 

These biases in the domain of the partial cepstrum 
correspond to a wide range of spectral distortions of 
the band F1-F2, close to those which may result from 
20 the telephone connection. 

By way of example, the set of frequency responses 
depicted in Figure 14 is proposed for the 187-3187 Hz 
band: each frequency response corresponds to a path 
25 from left to right in the lattice. The amplitude of 
their variations on this band does not exceed 20 dB, 
like extreme characteristics of the transmission and 
line systems. 

30 From these 81 frequency characteristics there are 
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calculated the 81 corresponding biases in the domain of 
the partial cepstrum, according to the processing 
described for the use of equation (0.4). By the 
addition of these biases to the corpus of 63 speakers 
5 previously used, a learning corpus is obtained 
including 5103 individuals representing various 
conditions (speaker, filtering of the connection) . 

In the case of classification by discriminating 
10 linear analysis: 

* Application of the classification criteria 

Let (a^)l<k<K-l be the f amily of discriminating 
15 linear functions defined from the learning corpus. A 

speaker represented by the vector x=[Fo;C^{]);...;C^{l)\ is 
allocated to the class q if the conditional probability 
of q knowing a(x), denoted P(q|a(x)), is maximum, a(x) 
designating the vector of components (a^ (x) ) l<k<K-l . 
20 According to Bayes' theorem, 

P{a[x)) 

Consequently P(q|a(x)) is proportional to 
25 P(a(x) |q) P(q) . In the subspace generated by the K-1 
discriminating functions, on the assumption of a multi- 
Gaussian distribution of the individuals in each class, 
the density of probability of a{x) within the class q 
has : 
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(0.7) 



5 where x' is the Gontro center of the class q, | Sq | 

designates the determinant of the matrix Sq, and Sq is 
the matrix of the covariances of a within the class q, 
of generic element 0**jk, which can be estimated by: 



10 



-9 



(0.8) 



The individual x will be allocated to the class q 
which maximises fq{x)P(q), which amounts to minimising 
on q the function sq(x) also referred to as the 
15 discriminating score: 

s^{x) = [a[xya[x'^ S~'{a{x)- a\7^ + logi\ |)- 2log{p{q)) , 
(0.9) 



20 The correction method proposed is implemented by 

the correction system ( equalizer cqualioor ) located in 
the digital network 40 as illustrated in Figure 10. 

Figure 16 illustrates the correction system able 
25 to implement the method. Figure 17 illustrates this 
system according to a variant embodiment as will be 
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detailed hereinafter. These variants relate to the 
method of calculating the modulus of the frequency 
response of the adapted equalizer GquQliacr restricted 
to the band F1-F2. 

5 

The pre-equalizer pro cqualiocr 200 is a fixed 
filter whose frequency response ; on the band F1-F2, is 
the inverse of the global response of the analogue part 
of an average connection as defined previously (UIT- 
10 T/P.830, 1996) . 

The stiffness of the frequency response of this 
filter implies a long-pulsed response; this is why, so 
as to limit the delay introduced by the processing, the 
15 pre-equalizer pro oqualioor is typically produced in 
the form of an RII filter, 20^^ order for example. 

Figure 15 shows the typical frequency responses of 
the pre-equalizer pro oqualiocr for three values of Fl . 
20 The scattering of the group delays is less than 2 ms, 
so that the resulting phase distortion is not 
perceptible . 

The processing chain 400 which follows allows 
25 classification of the speaker and differentiated 
matched equalization oqualioation . This chain comprises 
two processing units 400A and 400B. The unit 400A makes 
it possible to calculate the modulus of the frequency 
response of the equalizer oqualioor filter restricted 
3 0 to the equalization oqualioation band: EQ dB (F1-F2) . 



38740^1 .DOC 



MARKED-UP SPECIFICATION 
32 



The second unit 400B makes it possible to 

calculate the pulsed response of the equalizer 
cquQliacr filter in order to obtain the coefficients 
5 eq(n) of the differentiated filter according to the 
class of the speaker. 

A voice activity frame detector 401 triggers the 
various processings. 

10 

The processing unit 410 allows classification of 
the speaker , 

The processing unit 420 calculates the long-term 
15 spectrum followed by the calculation of the partial 
cepstrum of this speaker. 

The output of these two units is applied to the 
operator 428a or 428b. The output of this operator 
20 supplies the modulus of the frequency response of the 
equalizer cqualioor matched for dB restricted to the 
equalization cqualipQtion band F1-F2 via the unit 429 
for 428a, via the unit 440 for 428b. 

25 The processing units 430 to 435 calculate the 

coefficients eq(n) of the filter. 

The output x(n) of the pre-equalizer pro oqualiocr 
is analysed by successive frames with a typical 
30 duration of 32 ms, with an interframe overlap of 
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typically 50%. For this purpose an analysis window 
represented by the blocks 402 and 403 is opened. 

The matched equalization cqualiGation operation is 
5 implemented by an RIP filter 300 whose coefficients are 
calculated at each voice activity frame by the 
processing chain illustrated in Figures 16 and 17. 

The calculation of these coefficients corresponds 
10 to the calculation of the pulsed response of the filter 
from the modulus of the frequency response. 

The long-term spectrum of x(n), Yx/ is first of 
all calculated (as from the initial moment of 
15 functioning) on a time window increasing from 0 to a 
voice activity duration T (typically 4 seconds), and 
then adjusted recursively to each voice activity frame, 
which is represented by the following generic formula: 

20 r^{f.n) = a{n)\X{f.n)f +(7 (/,«-/) , 

(0.10) 

where Yx (f/n) is the long-term spectrum of x at 
the n^^ voice activity frame, X(f,n) the Fourier 
25 transform of the n^^ voice activity frame, and a(n) is 
defined by equation (0.11). Denoting N the number of 
frames in the period T, 
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«(«) = — /-TT^- (0.11) 

This calculation is carried out by the units 421, 
422, 423. 

5 

Next there is calculated, from this long-term 
spectrum, the partial cepstrum Cp, according to the 
eiquation (0.4), used by the processing units 424, 425, 
426. 

10 

The mean pitch Fo is estimated by the processing 
unit 412 at each voiced frame according to the formula: 

Fo{m) = a{m)Fo{m) a{m)jFo (m - i) , (0.12) 

15 

where FO (m) is the pitch of the m*^^ voiced frame 
and is calculated by the unit 411 according to an 
appropriate method of the prior art (for example the 
autocorrelation method, with determination of the 
20 voicing by comparison of the standardized o tandardi ood 
autocorrelation with a threshold (UIT-T/G . 729 , 1996). 

Thus, at each voice activity frame, there is a new 
vector X of components, the mean pitch and the 
25 coefficients 1 to L of the partial cepstrum, to which 
there is applied the discriminating function a defined 
from the learning corpus. This processing is 
implemented by the unit 413. The speaker is then 
allocated to the minimum discriminating score class q. 
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The modulus in dB of the frequency response of the 

matched equalizer cqualiacr restricted to the band Fl- 

F2, denoted | EQ | dB(Fi-F2) / is calculated according to one 
5 of the following two methods: 

The first method (Figure 16) consists of 
calculating |EQ|fi-f2 according to equation (0.3), where 
Yref(f) is the reference spectrum of the class of the 
10 speaker (Fourier transform of the class center oontrc ) . 

This calculation method is implemented in this variant 
depicted in Figure 16 with the operators 414a, 428a, 
427 and 429. 

15 The second method (Figure 17) consists of 

transcribing equation (0.3) into the domain of the 
partial cepstrum, and then the partial cepstrum of the 
output X of the pre-equalization pro oqualioation , 
necessary for the classification of the speaker, is 

20 available. Thus equation (0.3) becomes: 

where C^^ , , C^ j^^ and Cl_j^^ are the respective 

2 5 partial cepstra of the matched equalizer cqualiacr , of 
the output X of the pre-equalizer pro oqualiocr , of the 
reception system and of the reception line, C^^j being 

the reference partial cepstrum, the center cGntro of 
the class of the speaker. The partial cepstra are 
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calculated as indicated before, selecting the frequency 
band F1-F2. This calculation is made solely for the 
coefficients 1 to 20, the following coefficients being 
unnecessary since they represent a spectral fineness 
5 which will be eliminated subsequently. 

The 20 coefficients of the partial cepstrum of the 
matched equalizer oqualioor are obtained by the 
operators 414b and 428b according to equation (0.13). 

10 

The processing unit 441 supplements these 20 
coefficients with zeros, makes them symmetrical and 
calculates, from the vector thus formed, the modulus in 
dB of the frequency response of the matched equalizer 
15 oqualioor restricted to the band F1-F2 using the 
following equation: 

EQ,B<F,-F,,=TFD''(c:^). (0.14) 

20 This response is decimated by a factor of % by the 

operator 442. 

For the two variants which have just been 
described, the values of | EQ | outside the band F1-F2 
25 are calculated by linear extrapolation of the value in 
dB of |eq|fi-f2/ denoted EQdB hereinafter, by the unit 43 0 
in the following manner: 

For each index of frequency k, the linear 
30 approximation of EQds is expressed by: 
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(0.15) 



The coefficients al and a2 are chosen so as to 
minimise the square error of the approximation on the 
range F1-F2, defined by 



e='ZiEQjk)-EQjk)f 

k-k. 



(0.16) 



10 The coefficients al and a2 are therefore defined 

by: 



/' k, \ 



k. 



k^k, 



(0.17) 



15 The values of | EQ | , in dB, outside the band F1-F2, 

are then calculated from the formula (0.15). 

The frequency characteristic thus obtained must be 
smoothed. The filtering being performed in the time 
20 domain, the means allowing this smoothing is to 
multiply by a narrow window the corresponding pulsed 
response . 



25 



The pulsed response is obtained by an IFFT 
operation applied to |EQ| carried out by the units 431 
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and 432 followed by a syimmetrization o^yTnmctrioQtion 
performed by the processing unit 433, so as to obtain a 
linear-phase causal filter. The resulting pulsed 
response is multiplied, operator 435, by a time window 
5 434. The window used is typically a Hamming window of 
length 31 contrcd centered on the peak of the pulsed 
response and is applied to the pulsed response by means 
of the operator 435. 



10 
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ABSTRACT OF THE DISCLOSURE 

5 A technique for correcting the voice spectral 

deformations introduced by a communication network. 
Prior to the operation of equalization oqualioation of 
the voice signal of a speaker, the constitution of 
classes of speakers is communicated, with one voice 

10 reference per class. Then, for a given speaker, the 
classification of this speaker is communicated, that is 
to say his allocation to a class from predefined 
classification criteria in order to make a voice 
reference which is closest to his own correspond to 

15 him. Then, for that given speaker, communicating the 
equalization oquQlioation of the digitized digitiood 
signal of the voice of the speaker carried out with, as 
a reference spectriim, the voice reference of the class 
to which the speaker has been allocated. This technique 

20 applies to the correction of the timbre of the voice in 
switched telephone networks, in ISDN networks and in 
mobile networks. 
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